Sufficient Dimensionality Reduction with Irrelevant Statistics
نویسندگان
چکیده
The problem of unsupervised dimensionality reduction of stochastic variables while pre serving their most relevant characteristics is fundamental for the analysis of complex data. Unfortunately, this problem is ill defined since natural datasets inherently contain al ternative underlying structures. In this paper we address this problem by extending the re cently introduced "Sufficient Dimensionality Reduction" feature extraction method [7], to use "side information" about irrelevant struc tures in the data. The use of such irrelevance information was recently successfully demon strated in the context of clustering via the Information Bottleneck method [1]. Here we use this side-information framework to iden tify continuous features whose measurements are maximally informative for the main data set, but carry as little information as possi ble on the irrelevance data set. In statistical terms this can be understood as extracting statistics which are maximally sufficient for the main dataset, while simultaneously max imally ancillary for the irrelevance dataset. We formulate this problem as a tradeoff op timization problem and describe its analytic and algorithmic solutions. Our method is demonstrated on a synthetic example and on a real world application of face images, show ing its superiority over other methods such as Oriented Principal Component Analysis.
منابع مشابه
Sufficient Dimensionality Reduction with Irrelevance Statistics
The problem of unsupervised dimensionality reduction of stochastic variables while preserving their most relevant characteristics is fundamental for the analysis of complex data. Unfortunately, this problem is ill defined since natural datasets inherently contain alternative underlying structures. In this paper we address this problem by extending the recently introduced “Sufficient Dimensional...
متن کاملA sequential test for variable selection in high dimensional complex data
Given a high dimensional p-vector of continuous predictors X and a univariate response Y , principal fitted components (PFC) provide a sufficient reduction of X that retains all regression information about Y in X while reducing the dimensionality. The reduction is a set of linear combinations of all the p predictors, where with the use of a flexible set of basis functions, predictors related t...
متن کاملA Monte Carlo-Based Search Strategy for Dimensionality Reduction in Performance Tuning Parameters
Redundant and irrelevant features in high dimensional data increase the complexity in underlying mathematical models. It is necessary to conduct pre-processing steps that search for the most relevant features in order to reduce the dimensionality of the data. This study made use of a meta-heuristic search approach which uses lightweight random simulations to balance between the exploitation of ...
متن کامل5 Approximate Nearest Neighbor Regression in Very High Dimensions
Fast and approximate nearest-neighbor search methods have recently become popular for scaling nonparameteric regression to more complex and high-dimensional applications. As an alternative to fast nearest neighbor search, training data can also be incorporated online into appropriate sufficient statistics and adaptive data structures, such that approximate nearestneighbor predictions can be acc...
متن کاملAdditive Regression Splines With Irrelevant Categorical and Continuous Regressors
We consider the problem of estimating a relationship using semiparametric additive regression splines when there exist both continuous and categorical regressors, some of which are irrelevant but this is not known a priori. We show that choosing the spline degree, number of subintervals, and bandwidths via cross-validation can automatically remove irrelevant regressors, thereby delivering ‘auto...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1212.2483 شماره
صفحات -
تاریخ انتشار 2011